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A MONTE CARLO INVESTIGATION OF EXPERIMENTAL DATA 
REQUIREMENTS FOR FITTING POLYNOMIAL FUNCTIONS 


By George C. Canavos 
Langley Research Center 

SUMMARY 

This report examines the extent to which sample size affects the accuracy of a low- 
order polynomial approximation of an experimentally observed quantity and establishes a 
trend toward improvement in the accuracy of the approximation as a function of sample 
size. The task is made possible through a simulated analysis carried out by the Monte 
Carlo method, in which data are generated by using several transcendental or algebraic 
functions as models. Contaminated data of varying amounts are fitted to linear quadratic 
or cubic polynomials, and the behavior of the mean- squared error of the residual variance 
is determined as a function of sample size. Results indicate that the effect of the size of 
the sample is significant only for relatively small sample sizes and diminishes drastically 
for moderate and large amounts of experimental data. 

INTRODUCTION 

The purpose of this report is to investigate by Monte Carlo simulation the effect 
that the number of experimental data points has on smoothing out the influence of random 
error in an analytic -function approximation of an experimentally observed quantity. 

In an environment in which experimentation is the only source of information, it is 
often desired to determine and quantify the effect that some controlled variable exerts on 
a measured quantity which is nearly always subjected to random contamination. For 
many such cases, the true functional relationship is too vaguely known to be of practical 
use. Thus, some simple analytic function, such as a polynomial of relatively low degree, 
is used to approximate the behavior of the dependent variable within a prescribed range 
of the controlled variable. To determine the polynomial approximation, a reasonable 
amount of test data is needed to smooth out the effect of random error to some nominal 
value. However, in many instances the collection of laboratory data is becoming 
increasingly more difficult for reasons such as cost and complexity of test equipment. 
Consequently, it is advisable to plan carefully the collection of experimental data to 
enhance the relevancy of each data point while holding down its cost. Nevertheless, it is 
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conceivable that economic restrictions on the sample sine may compromise the accuracy 
of the approximation to an unacceptable level. Therefore, the purpose of this report is 
to shed some light on the problem of how the amount of test data affects the accuracy of 
a low-order polynomial approximation of a stochastic quantity ana to estaDnsn, at least 
for typical cases, a trend toward improvement in the accuracy as a function of sample 
siae In addition, the report summarizes some existing techniques on how the observa- 
tion points should be spaced within the range of the controlled variable to improve the 
predictive capability of the approximating function. 


To determine the extent to which the sample size affects the accuracy of an approx- 
imating polynomial function, a simulated analysis is carried out by appealing to Monte 
Carlo procedures (ref. 1). In order to include a practical range of possibilities, data are 
generated first by using one of several polynomial or transcendental functions as models 
and then adding random errors generated from a Gaussian distribution. However, in all 
cases the contaminated data are fitted to either linear, quadratic, or cubic polynomial 
functions. It is believed that these low-order polynomial functions are the most plausible 
to approximate the behavior of an experimentally observed quantity when compared with, 
say, high-order polynomials, which may fit the random error more than approximate the 

variable quantity. 


SYMBOLS 


E 

ni 

n 

x 

X 

y 

y 

0 


expectation operator 

degree of polynomial function 

number of measurements 

controlled variable 

matrix of controlled-variable values 

measured variable 

vector of measurements 

vector of unknown coefficients 


e 


vector of random errors 



error variance 


Subscripts; 

~ estimate 

T transpose of matrix 

An underlined symbol denotes a vector. 

REVIEW OF FUNDAMENTAL CONCEPTS 

Let the unknown functional relation between an observable quantity y and a con- 
trolled variable x be approximated by a polynomial of degree m in which the jth obser- 
vation is depicted by 

yj = 0 O + ^i x j + ^2 x j 2 + • • * + %i x j m + e j 0 = 1,2,. • -> n > (!) 

where 0Q, /3j,. . /3 m are the unknown coefficients of the polynomial, is the random 

error associated with the observation y., and n is the number of laboratory measure- 
ments used to fit the polynomial. In vector form, this set of n equations in m + 1 
unknown coefficients is written as 

y = + £ (2) 

where 
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If E( e ) = 0 and E^ee T ) = a 2 I, where a 2 is the error variance and I the 
appropriate identity matrix, then the least-squares estimates (refs. 2 and 3) of the com- 
ponents of 0 are determined by the result 



where j8 denotes the vector of estimates. 

A 

The quality of the estimates is measured by the variance-covariance matrix of j3 
given by (ref. 2) 

var(j3) = a 2 (x T x) 

The (i,i) element of a 2 (x T x) is the variance of while the (i,j) element 
corresponds to the covariance between and j3j for i ^ j. Since variance is a 

measure of dispersion, then the smaller the variance of a component of J, the better the 
estimate of that component. In real-world situations, however, the error variance cr 2 
is not likely to be known. Therefore, to compute var^ , an estimate of o 2 must be 
determined. This is usually the unbiased estimate (ref. 2) 


T 

u 2 = LJL 


f3 T X T y 


n - (ni + 1) 


(4) 


which is nothing more than the sum of squares of the residuals divided by the number of 
data points less the number of estimated coefficients. Thus, the estimate cr 2 is 
usually referred to as the residual variance. Whereas the error variance cr2 measures 
the magnitude of the random error, the residual variance o 2 measures the magnitude 
of the degree to which the fitted equation fails to describe the change in the dependent 
variable. The residual variance is an unbiased estimate of only if there is no 
model error; otherwise reflects both random variation and model error. For 
example, if the true model is an exponential type while the approximating function is a 
polynomial, then a 2 accounts for the error due to inherent differences between the true 
and fitted functions as well as for pure random error. 

Of major interest in an analytic-function approximation of a variable quantity is the 
ability to predict that quantity without a laboratory observation. Thus, let Xp be a point 
of prediction. The predicted value y p corresponding to Xp, from equation (1), is 
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where 


2p - *p> x p^> p • •» ^p” 1 ) 

From matrix algebra, the variance of y is 

Jr 

T ar (y p ) = Xp T (x T x) x p a 2 
-ip T ''ar(i)x p 

Therefore, the quality of is directly proportional to the quality of the least- squares 
estimates of the polynomial coefficients. Moreover, var is a function of the resid- 
ual variance f or example, the residual variance is zero, the predicted and 

observed values will coincide and the fitted polynomial function will model the observed 
data without error. On the other hand, an excessively large residual variance will result 
in a poor prediction capability. 

As stated earlier, the motivating force in an analytic-function approximation of a 
stochastic quantity is to predict the quantity without an actual measurement. Thus, it is 
imperative that the data-gathering procedure be carefully planned to control the size of 
the error between a laboratory measurement and the corresponding predicted value. In 
fact, how the observation points are spaced is related to the error of a predicted value. 

If, for example, some polynomial of unknown degree is to be tried as the approximating 
function, the optimal spacing of observation points is a uniform distribution throughout 
the selected range of the controlled variable (ref. 4). Such a spacing increases the likeli- 
hood of detection of an unusual behavior while holding down the size of the error. Alter- 
natively, if the degree of the approximating polynomial is known, spacing such as that 
considered by De la Garza (ref. 5) is preferred. Some optimal- spacing techniques in 
curve fitting are summarized briefly in the appendix. 

MONTE CARLO SIMULATION 

In every Monte Carlo simulation, it is mandatory to specify completely the process 
to be simulated and to identify quantities of interest (ref. 1). It is therefore necessary 
to present a detailed discussion for implementing the simulation procedure. 
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The objective is to fit a varying number of contaminated data points to one of three 
models (viz., linear, quadratic, or cubic polynomial) and determine the behavior of the 
random error as a function of the number of data points. Establishing the effect that 
sample size has on smoothing out the influence of random error in a polynomial-function 
approximation is essentially the same as determining the stability of the residual variance 
as a function of sample size. A quantity which measures the stability of any estimator is 
the mean-squared error of the estimator. The mean-squared error depicts the average 
squared difference between the estimator and the quantity it is estimating. Since the 
error variance must be an input to the simulation, it is possible to deteimine the squared 
difference between cr 2 and a 2 , given a model and a set of data points. By simulating 
and repeating such a procedure many times, the squared differences are accumulated and 
the mean-squared error of the residual variance is determined. 

Data are generated by using polynomial and transcendental functions as models and 
are uniformly distributed throughout the range of x for the following 11 values of n: 

5, 11, 21, 31, . . 101. The reason that odd sample sizes are selected is to allow for the 

inclusion of the midpoint and both extremes of the range of x while maintaining uniform 
spacing. 

When polynomials are used to generate data, the range of x is restricted to the 
interval (-1,1) for the purpose of providing reasonable control on the magnitude of the 
dependent variable. Moreover, two distinct values of the error variance a 2 are used: 

1 and 225. Since the magnitude of y is not likely to be excessive within the indicated 
range of x, it is believed that these two values of <j 2 provide modest and significant 
contamination to the data, respectively. Data are generated by using a polynomial func- 
tion of degree m § 3 and are fitted to the same model after contamination. Thus, for 
each value of n and a 2 , this part of the simulation is carried out according to the fol- 
lowing scheme: 

(1) Values for the coefficients of the polynomial are generated from the range -100 
to 100. 

(2) By using the generated coefficients and the appropriate X matrix, n uncon- 
taminated values of y are generated. 

(3) Each value of y is contaminated by generating a random number from a normal 
distribution with a mean of zero and a variance of a 2 . 

(4) The least-squares estimates of the coefficients and the residual variance are 
computed according to equations (3) and (4), respectively. 

(5) The squared differences between a 2 and a 2 are computed and stored. 

(6) Steps (1) to (5) are repeated 500 times to determine the mean-squared error of 
the residual variance. 
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The results are provided in figures 1 to 3, where the behavior of the mean- squared 
error of <j2 is given as a function of n for each polynomial model and each value 
of 

The second part of the computer simulation deals with generating data from trans- 
cendental functions, contaminating the data, and fitting them to either quadratic or cubic 
polynomial models. Three distinct functions are arbitrarily selected. These are 


y = 2(exp(-2x) - exp(-4xj] 

(0 s x ^ 2) 

(5) 

y = 2 x io 4x /(100+x) 

O 

O 

iH 

VII 

w 

VII 

o 

(6) 

y = lOx exp (- \fx/2) 

o 

o 

CS1 

VII 

X! 

Vfl 

CD 

(7) 


where the selected range of x is indicated for each function. 

Data generated by equation (5) are fitted to both quadratic and cubic models, while 
simulated data from equations (6) and (7) are fitted to quadratic and cubic models, respec- 
tively. The error variances used to generate Gaussian noise to contaminate the data gen- 
erated by equations (5), (6), and (7) are 0.01, 81, and 9, respectively. Figures 4 to 6 are 
provided to show generated data before and after contamination for each one of equations 
(5) to (7). 

The implementation of the simulation scheme for data generated by transcendental 
functions is analogous to that already discussed; that is, after generating the data by using 
one of equations (5) to (7), the simulation scheme picks up with step (3) of the outlined pro- 
cedure. The results are given in figures 7 to 10, where once again the behavior of the 
mean-squared error of is depicted as a function of n. On the basis of the overall 
results, the following conclusions are apparent: 

(1) The behavior of the mean-squared error of o ® as a function of n resembles 
a fast-decaying exponential curve. This result is in agreement with the expected behavior 
of statistical estimators. 

(2) In most cases, the mean-squared error is reduced dramatically as n increases 
to a moderate size. However, as n becomes larger, the mean-squared-error curves 
for nearly all cases flatten out so much that any further reduction may not be economically 
advantageous. 

(3) Within the scope of the investigation, the substance of the results appears to be 
nearly invariant with the source from which the data are simulated. The same comment 
also applies to different values of the error variance. 
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From the results of this investigation, the effect that the sample size has on 
smoothing out the influence of random error in an analytic -function approximation of a 
stochastic quantity appears to be significant only for small sample sizes and diminishes 
considerably for larger values of n. Thus, the careful planning of only a moderate num- 
ber of laboratory tests appears to be most beneficial. 

CONCLUDING REMARKS 

In an environment in which decisions are based on experimentation, it is often 
desired to determine an analytic representation of an experimentally observed quantity 
as a function of some controlled variable. Such a task is usually carried out by fitting 
laboratory measurements of the quantity to some simple analytic function, as a poly- 
nomial of relatively low degree. However, the amount of laboratory testing is becoming 
increasingly more restrictive mainly for economic reasons. Consequently, the purpose 
of this report has been to determine the effect of sample size on the accuracy of an 
analytic-function approximation of an experimentally observed quantity. Results obtained 
by using the Monte Carlo method indicate that for typical cases a moderate sample size 
provides an excellent trade-off between accuracy and economic restrictions. 

Langley Research Center, 

National Aeronautics and Space Administration, 

Hampton, Va., February 12, 1974. 


8 



APPENDIX 


OPTIMAL SPACING TECHNIQUES IN CURVE FITTING 

When the degree of the polynomial function to be fitted is known, an optimal spacing 
of the independent variable has been developed by De la Garza (ref. 5). Consider the 
polynomial function 

y = $ 0 + + £ 2 x 2 + . . . + /3 m x m 

of known degree m. Assume that n observations of y will be made within the range 
of x, which is scaled to the interval (-1,1) for convenience. De la Garza (ref. 5) showed 
that to minimize the maximum variance of a predicted quantity y, the optimum spacing 
of the n observations is accomplished by using no more than m + 1 distinct observa- 
tion points within the range of x. The spacing for minimax variance is provided in the 
following table through the cubic polynomial function: 


Model 

Observation points 

Number of measurements per 
observation point 

Linear 

1 

n/2 


-1 


Quadratic 

1 

n/3 


0 



-1 


Cubic 

1 

n/4 





-l/V 5 



-1 



Let Xp be any arbitrary point of prediction within the interval (-1,1). If the indi- 
cated optimal spacing is used, it has also been shown (ref. 5) that the maximum variance 
of y corresponding to the prediction point x n is 

r r 



(m + l)g 2 
n 
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APPENDIX - Continued 


where the residual variance a 2 usually replaces the unknown error variance o 2 . In 
fact, given the polynomial model, the absolute minimum variance of y p can also be 
determined. As an example, consider the cubic model. The matrix X which corre- 
sponds to the minimax variance spacing when n is a multiple of 4 is 


1 

-1 

1 

-1 ; 

1 

- v/s/s 

1/5 

-\/5/25 

1 

1/5/5 

1/5 

l/5/25 

1 

• 

1 

1 

1 

9 

9 

1 

-1 

1 

-1 

1 

-s/ 5/5 

1/5 

-v/5/25 

1 

^/5 

1/5 

v/5/ 25 

i 1 

1 

1 

1 


Thus, 



0 


0 

3/5 

0 

3/5 

0 

13/25 

0 

13/25 

0 

13/25 

0 

63/125 
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APPENDIX — Continued 








|l3 ' 

0 

-15 

0 

(x T x) 1 = — 


63 

0 

-65 

v 4n 

-15 

0 

25 

0 


0 

-65 

0 

75_ 


Therefore, if Xp is the point of prediction, 

var (y p ) = x p T (x T x) Xpa2 


fl3 


0 


-15 






= [* *P V 



63 0 

0 25 

-65 0 


-65 

0 



cr^ 

4n 




13 + 33x p 2 - 105x p 4 + 75x p 6 
= 4^ 

O 

where a is temporarily dropped for convenience. It follows that the points at which 
maximum or minimum variances occur are determined by solving the equation 

dx p 

or 

66Xp - 420Xp 3 + 450Xp 5 = 0 
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APPENDIX - Concluded 


which, upon simplification, yields the values Xp - 0, ±\/ll/15, ±\/l/5. By examining 
the second derivative, it is determined that the minimum variance for a cubic model 
is 2.578 a V n and occurs at Xp = ±\/ll/15, while the maximum variance 4o 2 /n occurs 

at Xp = ±\/l/5. 

Assuming the spacing for minimax variance with regard to linear, quadratic, and 
cubic polynomials, the following table provides upper and lower bounds of the variance 
of a predicted value y p when the prediction point is within the range -1 to 1: 


Model 

Variance boundaries for a 
predicted value 

Linear 


Quadratic 


Cubic 

^ svar (y p )^ 
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Number of data points 

(b) a 2 = 225. 

Figure 1.- Mean- squared error of residual variance for a linear model 



Mean- squared erj 




Figure 2.- Mean-squared error of residual variance for a quadratic model 









Figure 5.- Pure and contaminated data generated by equation (6). 
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>10 160 180 200 


>y equation (7). 
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Mean-squared error 



Figure 7.- Mean-squared error of residual variance for data generated by equation (5) 

and fitted to a quadratic polynomial. 



Figure 8.- Mean-squared error of residual variance for data generated by equation (5) 

and fitted to a cubic polynomial. 
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Figure 9.- Mean-squared error of residual variance for data generated by equation (6) 

and fitted to a quadratic polynomial. 



Figure 10.- Mean-squared error of residual variance for data generated by equation (7) 

and fitted to a cubic polynomial. 
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